The skillIQ and roleIQ tests are addictive. I haven’t used Pluralsight to learn and improve my technical skills yet, but I can see how the assessments would drive interaction and frequent improvement of subscribers. What a fun way to encourage personal development.

Data Exploration Questions

1. Describe and visualize how the distributions of user and question rankings compare and relate between assessments.

User Ranking Distributions

Overall Ranking Metrics

Interaction Ranking Metrics

Question Ranking Distributions

Question Ranking Metrics by Assessment

2. How does it appear the algorithm determines when a user’s assessment session is complete?

We can evaluate the algorithm’s determination to stop asking questions using a time-series of each assessment. The obvious guess is a minimal threshold for question-to-question changes in the RD value. Something very similar to that guess is confirmed by observing a random sample of several user_assessment_session_ids.

It’s probably worth checking the other metrics associated with a session (display_score, percentile, and ranking) to confirm our suspicions regarding rd as the main variable driving the algorithm. Per the plots below of the same three assessment sessions we see that rd is the only metric of the four that seems an appropriate option.

A closer look at the distribution of the minimum rd values of each assessment’s interaction shows that a simple threshold of 80 drives the stopping rule. Over 75% of the sessions were stopped at a rd value below and very near 80. While that seems like an arbitrary value to me, I am sure there was some empirical and theoretical studies performed to determine that threshold. Also 75% may seem low, but that includes all sessions, even those that were stopped prematurely by the user (as discussed in #3).

##        0%        5%       10%       15%       20%       25%       30% 
##  77.99380  78.34009  78.42960  78.51234  78.59056  78.65580  78.70929 
##       35%       40%       45%       50%       55%       60%       65% 
##  78.76540  78.82022  78.87147  78.93500  79.00305  79.08484  79.19052 
##       70%       75%       80%       85%       90%       95%      100% 
##  79.34391  79.65148  94.55070 124.50190 156.89920 202.17270 256.61200

3. Which of the assessments has the highest and lowest dropout rates, respectively?

## # A tibble: 2 x 2
##   rd_threshold `n()`
##          <dbl> <int>
## 1            0  1608
## 2            1  5070
## # A tibble: 32 x 3
## # Groups:   rd_threshold [?]
##    rd_threshold n_questions_answered `n()`
##           <dbl>                <int> <int>
##  1            0                    0   271
##  2            0                    1   211
##  3            0                    2   189
##  4            0                    3   165
##  5            0                    4   112
##  6            0                    5   127
##  7            0                    6   101
##  8            0                    7   100
##  9            0                    8    70
## 10            0                    9    54
## # ... with 22 more rows

4. Is there significant variance in question difficulty by topic within a given assessment?

5. How many times must a question be answered before it reaches its certainty floor? Does that number appear to be constant or does it vary depending on question or assessment?

I think there are 724 questions in the data.

## [1] 724

I think I’ll have to look at all 724 of these using trelliscopejs. Should be an interesting view and help me see if roughly 10 question answers is sufficient or if it’s more.

More Involved/Open-ended Questions

1. Identify a metric that could be used to identify questions that are performing poorly, and consequently might need to be reviewed, changed, or removed.

  • Questions that render a nearly always incorrect answer, especially when the question difficulty is comparatively low. (Some questions are likely purposefully difficult so one expects those to rarely have a correct response.)
  • Questions that increase the RD metric substantially (though that may be a function of question order).
  • (thinking of a scatterplot comparing rd change due to that question vs current percentile of the user, meaning some identification of outliers occurring when rd change is high and negative and percentile was low)

2. Suppose an update to Python causes a question’s answer to change, but our question authors don’t notice, and the now-outdated question remains in the test. How might that scenario reveal itself in the data?

Hopefully it reveals itself as often rendering an incorrect response. That may not be true of more experienced or long-time users of that technology/language so one might need to account for that somehow. I noticed a link at the bottom of the page after the answer is revealed that provided an opportunity for a situation like this to be identified.

3. Given your response to number 2 in the Data Exploration Questions above, what is a method we could use to determine ideal points to stop a user’s assessment session (i.e. identify the right balance between certainty and burden on the user)?

I suppose you could try to account for the distribution/curve of previous assessments of that user. For example, if they have taken several assessments before the current assessment you may be able to predict/extrapolate the end score and ranking based on their position part way through the assessment. Taking that a step further, why not treat each step of an assessment (for a giving topic) as a modeling and prediction opportunity by developing a deep learning model trained to the eventual outcome of the assessment. That way you could use the thousands (or millions) of assessments for that topic to generate a prediction such that you could stop the assessment once the prediction has reached a certain threshold of accuracy per the model. Just to be clear I am thinking of a different deep learning model (or potentially any predictive model) for each set of questions of a given topic in order. That wasn’t very clear so … one model based on five questions answered, then a model based on six questions answered, and so on.

4. How could we calculate the overall difficulty level of a particular topic? How might we then calculate a topic-level score for a single user?

You may get close by determining what combination of topics tend to be taken by users. If a set of users are prone to take the same five topic assessments (and rarely other topics) then you could look to see if which topic was the most difficult to that group. As an example, business analysts may consistently take the data warehousing, data analytics/visualization, SQL, and Python assessments and often struggle score lower in the Python assessment. I wonder if the frequency of the topic assessed is an indicator of the difficulty. Certainly the frequency relates to the popularity and the general demand/usefulness of the topic; as well as the newness of the topic (newer tools/tech/languages may be taken less frequently - following an adoption curve). Fortran or other older languages/technologies may be considered more difficult simply because less modern learning methods exist for them. How is “difficult” defined here?